Goto

Collaborating Authors

 full training



Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Li, Changlin, Zhang, Jiawei, Liu, Shuhao, Lin, Sihao, Shi, Zeyi, Li, Zhihui, Chang, Xiaojun

arXiv.org Artificial Intelligence

Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.


First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization

Aryan Mokhtari, Alejandro Ribeiro

Neural Information Processing Systems

This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically - e.g., scaling by a factor of two - and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme.


Models Got Talent: Identifying High Performing Wearable Human Activity Recognition Models Without Training

Goldman, Richard, Komperla, Varun, Ploetz, Thomas, Haresamudram, Harish

arXiv.org Artificial Intelligence

A promising alternative to the computationally expensive Neural Architecture Search (NAS) involves the development of Zero Cost Proxies (ZCPs), which correlate well with trained performance, but can be computed through a single forward/backward pass on a randomly sampled batch of data. In this paper, we investigate the effectiveness of ZCPs for HAR on six benchmark datasets, and demonstrate that they discover network architectures that obtain within 5% of performance attained by full-scale training involving 1500 randomly sampled architectures. This results in substantial computational savings as high-performing architectures can be discovered with minimal training. Our experiments not only introduce ZCPs to sensor-based HAR, but also demonstrate that they are robust to data noise, further showcasing their suitability for practical scenarios.


ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

Yuan, Zhuowen, Liu, Tao, Yang, Yang, Wang, Yang, Qi, Feng, Rangadurai, Kaushik, Li, Bo, Yang, Shuang

arXiv.org Artificial Intelligence

Recent LLM-based agents have demonstrated strong capabilities in automated ML engineering. However, they heavily rely on repeated full training runs to evaluate candidate solutions, resulting in significant computational overhead, limited scalability to large search spaces, and slow iteration cycles. To address these challenges, we introduce ArchPilot, a multi-agent system that integrates architecture generation, proxy-based evaluation, and adaptive search into a unified framework. ArchPilot consists of three specialized agents: an orchestration agent that coordinates the search process using a Monte Carlo Tree Search (MCTS)-inspired novel algorithm with a restart mechanism and manages memory of previous candidates; a generation agent that iteratively generates, improves, and debugs candidate architectures; and an evaluation agent that executes proxy training runs, generates and optimizes proxy functions, and aggregates the proxy scores into a fidelity-aware performance metric. This multi-agent collaboration allows ArchPilot to prioritize high-potential candidates with minimal reliance on expensive full training runs, facilitating efficient ML engineering under limited budgets. Experiments on MLE-Bench demonstrate that ArchPilot outperforms SOTA baselines such as AIDE and ML-Master, validating the effectiveness of our multi-agent system.


Appendix for Efficient Low rank for Vision Transformer Adaptation A More Experimental Results for Full Training in Table 2 Section 4.2

Neural Information Processing Systems

Table 5 shows more results for training the entire model. Indeed, these results further demonstrate the effectiveness of our LBP-WHT approach.Full Training Model Method R Speedup mAcc MFLOPs CF100 CF10 Cars Flowers Food PetsEfficient Former L1 (Hybrid) Full BP - 1.0 90.61 5841.09 " refers to our LBP-WHT method with "Hybrid" represents CNN-ViT -hybrid architecture. Any results that have higher speed or mAcc are highlighted in bold. On the other hand, LoRA efficiently reduces the memory usage needed to store the weights gradient. These results confirm the effectiveness of our method. " refers to our LBP-WHT method with As shown in Table 7, our method scales well on large scale datasets.



Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

Zhou, Yuning, Badgery, Henry, Read, Matthew, Bailey, James, Davey, Catherine

arXiv.org Artificial Intelligence

Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high-quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs' performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.


Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Cho, Yeseul, Shin, Baekrok, Kang, Changmin, Yun, Chulhee

arXiv.org Artificial Intelligence

Advancements in deep learning have been significantly driven by large-scale datasets. However, recent studies have revealed a power-law relationship between the generalization capacity of deep neural networks and the size of their training data (Gordon et al., 2021; Hestness et al., 2017; Rosenfeld et al., 2019), meaning that the improvement of model performance becomes increasingly cost-inefficient as we scale up the dataset size. Fortunately, Sorscher et al. (2022) demonstrate that the power-law scaling of error can be reduced to exponential scaling with Pareto optimal data pruning. The main goal of dataset pruning is to identify and retain the most informative samples while discarding redundant data points for training neural networks. This approach can alleviate storage and computational costs as well as training efficiency. However, many existing pruning methods require training a model with a full dataset over a number of epochs to measure the importance of each sample, which ironically makes the pruning process more expensive than just training the model once on the original large dataset. For instance, several score-based methods (Gordon et al., 2021; He et al., 2024; Pleiss et al., 2020; Toneva et al., 2018; Zhang et al., 2024) require training as they utilize the dynamics from the whole training process. Some geometry-based methods, (Xia et al., 2022; Yang et al., 2024) leverage features from the penultimate layer of the trained model, therefore training a model is Authors contributed equally to this paper.


First-Order Adaptive Sample Size Methods to Reduce Complexity of Empirical Risk Minimization

Aryan Mokhtari, Alejandro Ribeiro

Neural Information Processing Systems

This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically - e.g., scaling by a factor of two - and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme.